Skip to content

Conversation

@toillium
Copy link

@toillium toillium commented Nov 16, 2025

What does this PR do?

Makes the clickhouse client thread safe by disabling autogenerate_session_id.

Relevant docs: https://clickhouse.com/docs/integrations/language-clients/python/driver-api#multi-threaded-applications

Motivation

#21885

@toillium
Copy link
Author

cc @sarah-witt @steveny91

@codecov
Copy link

codecov bot commented Nov 16, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.03%. Comparing base (8c6c226) to head (a4c8772).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@toillium toillium force-pushed the joe/disable-client-session-ids branch 2 times, most recently from 624bfd2 to f15e379 Compare November 16, 2025 18:34
@toillium toillium force-pushed the joe/disable-client-session-ids branch from f15e379 to 68d1ce7 Compare November 16, 2025 18:35
Copy link
Contributor

@AAraKKe AAraKKe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @toillium ! I am not 100% sure why this would be an issue since as far as I can tell the check runs synchronously. Would you mind sharing where do you think the multi-threaded behavior is?

The error you report seeing is

HTTPDriver for <URL> received ClickHouse error code 373\n Code: 373. DB::Exception: Session 7de98854-bff0-481b-a959-aa18f3ec9e4d is locked by a concurrent client. (SESSION_IS_LOCKED)

This means that the error is coming from the server returning 373 right? Is it possible that somehow the query launched against clickhouse is timing out somehow but we reuse the client and launch a new query against the server, causing the session already used error?

I am bit worried that by not generating the session ID we might be hiding an issue with timeouts for instance that we should be resurfacing to customers.

My Feedback Legend

Here's a quick guide to the prefixes I use in my comments:

praise: no action needed, just celebrate!
note: just a comment/information, no need to take any action.
question: I need clarification or I'm seeking to understand your approach.
nit: A minor, non-blocking issue (e.g., style, typo). Feel free to ignore.
suggestion: I'm proposing an improvement. This is optional but recommended.
request: A change I believe is necessary before this can be merged.

The only blocking comments are request, any other type of comment can be applied at discretion of the developer.

@toillium
Copy link
Author

Would you mind sharing where do you think the multi-threaded behavior is?

I think the multithreaded behavior is coming from datadog agent's query scheduling.

This means that the error is coming from the server returning 373 right? Is it possible that somehow the query launched against clickhouse is timing out somehow but we reuse the client and launch a new query against the server, causing the session already used error?

I think the issue is query A launches with session id foo, then shortly after query B launches with session id foo while query A is still running (implies multithreading is being used). Query B times out while waiting for the lock on the session id, and then query A completes sometime later.

I am bit worried that by not generating the session ID we might be hiding an issue with timeouts for instance that we should be resurfacing to customers.

I don't think session id impacts how timeouts are surfaced. My understanding is that it's only used to determine the lifespan of temporary resource on the clickhouse server (e.g. temporary tables).

@toillium toillium force-pushed the joe/disable-client-session-ids branch from 9fe1a22 to 9cc9c01 Compare November 24, 2025 15:44
@toillium toillium force-pushed the joe/disable-client-session-ids branch from 37e509e to 3b6a98e Compare November 24, 2025 16:03
@joe-clickhouse
Copy link

Hi all, Joe from clickhouse here (maintainer of clickhouse-connect). Setting autogenerate_session_id=False should indeed resolve the SESSION_IS_LOCKED errors. This makes the client "safe enough" for typical concurrent query usage. Just note that it's not FULLY thread-safe. e.g. avoid calling methods like set_client_setting() or similar concurrently with queries. But for the use case of concurrent queries without modifying client state, you should be good to go.

Hope that helps! Happy to help provide more context or answer any other clickhouse-connect questions as well.

@toillium toillium requested a review from AAraKKe November 25, 2025 18:43
@AAraKKe
Copy link
Contributor

AAraKKe commented Nov 27, 2025

Thanks guys!

I think the multithreaded behavior is coming from datadog agent's query scheduling.

That is my point, the agent itself does not manage any queries, this is done at an integration level and the integration itself is synchronous. What I am afraid is happening is the following, @joe-clickhouse maybe you can correct me if this is not how the client behaves.

  • The agent calls the check.
  • The client is created with a given send_receive_timeout and connect_timeout.
  • Query manager sends 20 queries, sequentially.
  • Before finishing to execute the query we hit the timeout in the client, stop listening but the queries are still running in clickhouse.
  • Next time the agent calls the check, since we already have the client defined (with a given session id) we reuse the client to launch the same queries again.
  • Clickhouse receives a query with the same session as one that is already running and returns 373.

I am not sure if clickhouse will cancel ongoing queries if the client timesout. I have worked with datastores where a query is not cancelled if the client disconnects, or marked as stale. In a sense, if we call the datastore we can run into a scenario in which a pending query is blocking the session.

If this is possible with clickhouse I am worried that the customers are having queries timing out they do not know about. That is what I would try address.

@joe-clickhouse do you think this is possible? While removing the session id generation is a fix for the errors, we should get it merged, I would like to understand if this is scenario is possible and we should improve our error handling. Also, if there is any way to ensure clickhouse cancels an ongoing query. These could start accumulating generating load on the cluster for queries that no one is going to listen to.

@joe-clickhouse
Copy link

joe-clickhouse commented Dec 1, 2025

Hi @AAraKKe, yes this definitely a possible scenario. So if that's the use case, then this is definitely a valid concern.

So, in it's current state if we assume this scenario:

  1. agent calls the check and a client is created with a session_id (default behavior)
  2. client sends queries sequentially
  3. client timeout occurs and the client will stop waiting for a response
  4. queries will continue running on the CH server
  5. next agent check runs and reuses the same client with the same session_id

This will cause new queries to collide with the still-running queries and raise 373 errors. This PR will prevent automatic client-side session_id generation which will fix (or prevent) the 373 errors from occurring.

That said, @AAraKKe your instinct is correct and if the client times out the sever will not (by default) cancel the query so query accumulation is a valid concern in this case if you're having client connection/timeout issues and submitting the same thing multiple times.

There are a few things we could do to mitigate these kinds of potential issues however:

  1. The most reliable approach would probably be setting the server-side setting max_execution_time (see docs) to a value slightly less (like maybe 10 seconds or so) than the client-side setting send_receive_timeout (default of 300s). This should prevent any queries from ever reaching the client timeout. This also looks safe in datadog's QueryExecutor.
  2. this one is less reliable I think but on the server we can set cancel_http_readonly_queries_on_client_close=1 (see docs) which will tell the server that if it detects the connection close, it should cancel the query. Note that this isn't 100% guaranteed to cancel the query immediately. CH will cancel the query when it detects the connection close, but if it's deep in query execution and not frequently checking for client disconnects, there may be a delay before cancellation actually happens.

Side note, I think your scenario is just theoretical but the default client-side timeout default is 5 minutes. I think it'd be pretty rare for correctly designed and working queries to regularly hit that limit so if for some reason they are, it's worth a different discussion.

Hope that helps!

@AAraKKe
Copy link
Contributor

AAraKKe commented Dec 4, 2025

Thanks @joe-clickhouse! I appreciate your help with this, I will follow up with the team and come up with a solution. What I am thinking would be the best approach is:

  • Add a client_timeout metric to the integration so customers know that their client is timing out so they can reconfigure timeouts or investigate further.
  • Update the documentation to help them understand what can be done to improve this following your explanation on the first bullet point.

Regarding the default timeout, that is the default on the clickhouse_connect client side, but the integration has a 10s default or whatever the customer sets in their integration config, so it is pretty much on the customer hands to land in this scenario. See here.

I am approving the PR to prevent the 373 for now and we will update the integration to handle that appropriately.

Thanks guys!

@AAraKKe AAraKKe added this pull request to the merge queue Dec 4, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants